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Abstract 


To preserve the consistency of on-line, long-lived, distributed data in the presence of 
concurrency and in the event of hardware failures, it is necessary to ensure atomicity and 
data resiliency in applications. The programming language Argus is designed to support 
such applications. This thesis investigates the mechanism needed to support the notion of 
data resiliency present in Argus. Data resiliency means that the probability is very high that 
the crash of a node or storage device in a distributed system does not cause the loss of vital 
data. Data resiliency requires the use of stable storage devices, memory devices .that 
survive failure to a high probability. This thesis is not concerned with how to implement 
stable storage devices, but rather with how to organize the use of stable storage. The thesis 
presents a new organization of stable storage called the hybrid log that provides fast writing 
of information to stable storage and reasonably fast recovery of information from stable 
storage. In the context of this scheme, various algorithms are developed for writing objects - 
to the log, recovering objects from the log, and housekeeping the log. 


Thesis supervisor: Barbara H. Liskov 
Title: Professor of Computer Science and Engineering 


Keywords: Atomic actions, atomic objects, distributed systems, logs, recovery, shadowing, 
stable storage, transactions 


Table of Contents 


Chapter One: Introduction 


1.1 Stable Storage 

1.2 Organizing Stable Storage 
1.2.1 Logging versus Shadowing 
1.2.2 The Approach 

1.3 Related Work 
1.3.1 System R Recovery Manager 
1.3.2 Swallow 

1.4 Outline of thesis 


Chapter Two: Background 


2.1 The Programming Language Argus 
2.2 Two-phase Commit Protocol 
2.2.1 The Coordinator 
2.2.2 The Participant 
2.2.3 Effects of crashes on Two-phase commit 
2.3 The Recovery System 
2.4 Recoverable Objects 
2.4.1 Atomic Objects 
2.4.2 Mutex Objects 
2.4.3 Incremental Copying Algorithm 


Chapter Three: Simple Log -- Writing and Recovery Algorithms 


3.1 Log abstraction interface to stable storage 
3.2 Structure of the simple log 
3.3 Writing objects to the log 
3.3.1 The Coordinator 
3.3.2 The Participant 
3.3.3 Writing data entries 
3.3.3.1 Copying Data 
3.3.3.2 What to Write 
3.3.3.3 The Writing Algorithm 
3.4 Recovering objects from the log 
3.4.1 Sketch of the General Algorithm 
3.4.2 Log Scenarios. 
3.4.3 Turning uids into pointers 
3.4.4 The General Recovery Algorithm 


Chapter Four: Hybrid Log -- Writing and Recovery Algorithms 
4.1 Simple log versus Hybrid log 


2SR2SESRESBBSBR GA SBRBB 


gs 


4.2 Writing objects to the log 

4.3 Recovering objects from the log 
4.3.1 Sketch of the General Algorithm 
4.3.2 Log Scenario and Recovery 
4.3.3 The General Recovery Algorithm 

4.4 Early prepare 


Chapter Five: Hybrid Log -- Housekeeping Algorithms 


5.1 Compacting the log 
5.1.1 The Compaction Algorithm 
5.1.2 The New Recovery Algorithm 
5.2 Taking a snapshot of the stable state . 
5.3 Summary 


Chapter Six: Conclusions 


References 


Table of Figures 


Figure 1-1: Shadowed objects 

Figure 2-1: The Recovery System 

Figure 2-2: An Atomic Record 

Figure 3-1: Data entries and Outcome entries 

Figure 3-2: Format of recoverable objects in volatile memory 
Figure 3-3: Objects in volatile memory 

Figure 3-4: Flattened Object 

Figure 3-5: Newly Accessible Objects Example 

Figure 3-6: Newly Accessible Objects 

Figure 3-7: Log of atomic objects after acrash — 

Figure 3-8: Log of mutex objects following a crash 
Figure 3-9: Log following a crash 

Figure 3-10: Coordinator’s log following a crash 

Figure 4-1: New format of log entries 

Figure 4-2: Log after the prepare phase 

Figure 4-3: Hybrid log after T1 prepares and T2 commits 


SASGBRRRBLRLBNYVBA 


11 


Figure 1-1: Shadowed objects 


1.2.2 The Approach 


Let us summarize the advantages and disadvantages of these two schemes: 
1. Log => fast writing, but slow recovery 
2. Shadowing => slow writing, but fast recovery 

In comparing the two approaches we assume that crashes do not happen very often 
and that we would like normal processing to be fast at the possible expense of a slow 
recovery after a crash. . 

For reasons to be discussed in later chapters, we have chosen an organization of 
stable storage that falls between these two extremes, which we call the hybrid log. As the 
name suggests, it is a hybrid of the pure log and the shadowing schemes that combines the | 
advantageous characteristics of either scheme. Hence, writing is almost as fast as the pure 
log, and recovery is faster than the pure log scheme but not quite comparable with the 
shadowing scheme. The map in the shadowing scheme is now written incrementally to the 
hybrid log and is distributed over the entire log; this means that the extra cost associated 
with updating the map at every action commit in the shadowing scheme is just part of the 
cost of writing entries to the log. 

Given this hybrid organization, we have also developed three kinds of algorithms: (1) 


writing objects to the hybrid log, (2) recovering objects from the hybrid log, (3) and 
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shadowing schemes considered alone. We then explain the writing and recovery algorithms 
for the hybrid log. Finally, we point out the complications introduced by the notion of early 
prepare. 

Chapter 5 considers the problem of reorganizing the hybrid log to make recovery from 
crashes more efficient. Two methods are discussed and compared: log compaction and 
stable state snapshot. 

Finally, in Chapter 6 we summarize the foregoing, draw conclusions, and suggest 


directions for further research. 
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lf a coordinator crashed before the: committing record was written to stable storage 
for some committing action, then it will remember nothing about the action after recovery, 
and the action will be aborted. If the coordinator receives a query about the action from a 
participant, it will tell the participant to abort the action. When the committing record 
- appears in stable storage the action has really committed; this entry marks the point of no 
return for the coordinator, after which it must commit. Suppose, however, a coordinator 
crashed after the committing record was written to stable storage, but before the done 
record was written. Then upon recovery the action is still committing and the recovery 
system restores the guardian’s state as it had been before the crash. 
If a coordinator crashed after the done record was written to stable storage for some 


committing action, then this action has completed and nothing special need be done. 


2.3 The Recovery System 


prepare(aid, MOS) 
The commit(aid) 
abort(aid) 
Argus recovery = 
housekeeping 
committing(aid gids) ‘ 
done(aid) 


Figure 2-1: The Recovery System 


The job of the recovery system is to write information to stable storage as needed by 
two-phase commit, to restore a guardian’s stable state after a crash, and to reorganize 
stable storage in order to make recovery more efficient. The recovery system provides 
operations that the Argus system calls at appropriate times in order to carry out these tasks. 
See Figure 2-1. The Argus system itself is distributed, every guardian containing a portion of 
it; the recovery system also exists at each guardian and is called by the portion of the Argus 
system at that guardian. . 
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-recoverable object but not any contained recoverable objects; these will be 
copied separately if they were modified. The sharing of objects is preserved 
only for shared recoverable objects or for a group of unrecoverable objects 
entirely contained within a recoverable object. 


3. Once the recoverable object, including its contained non-recoverable objects, 
has been copied, the recovery system releases possession and continues. 


To copy a recoverable object, the system invokes a routine that linearizes (or flattens) 
the data in the modified object and in any contained non-recoverable objects. Any 
references to other recoverable objects are translated from their volatile addresses to their 
corresponding stable storage references. Figure 2-2 illustrates this technique. In copying 
the object referred to by variable z, we copy x but not y (since y is atomic but x is not); 
instead, we place a stable storage reference for y in the copy of z, and copy y separately if 


necessary (if it was modified or was new). 


z: atomic record[x: int, y: atomic array[int]] 
Figure 2-2: An Atomic Record 


In short, the system gains possession of each recoverable object that had been 
modified by the action, copies it, releases possession, and continues. 
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(uid) of the recoverable object, (2) the object type--mutex or atomic, (3) the object value, and 
(4) the action identifier (aid) of the top-level action that is preparing. The object "value" is 


not the actual object itself residing in volatile memory but a copy of the object’s version. 


Data entry 
object uid 
object type 


object vaiue 


Outcome entries for participants 


prepared committed aborted 


base committed prepared data 


Outcome entries for coordinators 


committing 


done 


Figure 3-1: Data entries and Outcome entries 


The object’s unique identifier is some identifier that will never be reused and is unique 
with respect to the object's guardian. Since this identifier will not serve any other purpose 
except to distinguish recoverable objects from one another, the unique object generator can 
be a stable counter associated with each guardian, that is, an integer that is incremented 
whenever a recoverable object needs a uid. There is no danger of a uid being reused after a 
crash because the recovery system knows after recovery of each guardian the last uid that 
was generated and assigned to a recoverable object at that guardian; the stable counter can 


29 


either:a coordinator or a participant; thus, a guardian’s log could contain outcome entries _ 
for a coordinator when the guardian acts as coordinator and for a participant when the 
guardian behaves like a participant. 

We will elaborate further on these different outcome entries in the next several 


sections when we discuss the writing of objects to the log. 
3.3 Writing objects to the log 


Recoverable objects are written to the log only when top-level actions commit and to 
ensure that effects of top-level actions are made permanent, the system goes through the 


standard two-phase commit protocol described in the previous chapter. 
3.3.1 The Coordinator 


After sending out prepare messages to all the participants (including itself since it is 
also a participant), the coordinator waits for replies. If any participant replies aborted, or if 
the coordinator aborts unilaterally, then the coordinator tells the participants to abort via 
abort messages. If it hears from each participant that each has prepared it starts the 
committing phase. 

If all participants respond prepared, the recovery system creates a committing 
outcome entry and forces it to the coordinator’s log. (Whenever we say that a log entry is 
forced to the log, we mean that the force_write operation on the log object is invoked with 
the log entry.) At this point the action is committed. The coordinator then sends commit 


messages to all the participants (including itself), informing them of its verdict, and waits for 


3 them to respond. When all have responded committed the coordinator creates a done 


coordinator outcome entry and forces it to the coordinator's log. Two-phase commit is now 


complete. 
3.3.2 The Participant 


When a participant receives a prepare message from the the coordinator it prepares in 
the following way. In general, for each object in the MOS the recovery system constructs 
data entries and writes them to the log. If the data entries were written successfully to the 


31 


discussed in Chapter 2 on the data portion of the recoverable object, in particular, on the 
appropriate version (current or base version if the object is atomic, or the current version if 
the object is mutex). As the copy proceeds, the algorithm follows volatile memory 
references, replacing references to recoverable objects with their uids and simply copying 
any regular objects. The data is now flattened. The recovery system then creates a data 
entry containing the object uid, the action id of the action that is preparing, the object type, 
and the flattened data. And it is this data entry that is written to the log. 
Figure 3-3 shows a possible situation involving atomic, mutex, and regular objects. 


or] 

pe 4 
<a 
com one | 
cc el a 
reste | aaa | 


Figure 3-3: Objects in volatile memory 


Suppose object O,, which was modified by action T,, is to be copied to the log. The 
incremental copying algorithm follows pointers in the data portion of the object. The 
reference to object 0, (a mutex object) is replaced with the uid 0, itself. The algorithm 
-copies the regular object and in so doing discovers that it contains a reference to yet 
another recoverable object, namely O14 an atomic object; it replaces the reference with O, 
itself. And finally, the algorithm replaces the reference to object O,, an atomic object, with 
the uid O,. 

In flattened form, O, looks like Figure 3-4, 


QTert AMRUS BE te se - 


Figure 3-4: Flattened Object 


3.3.3.2 What to Write 


Having discussed the manner in which data is copied to the log as data entries, let us - 
consider the question of what actually gets written. As we mentioned before, we are 
interested only in those recoverable objects that are accessible from the stable variables 
because these make up the stable state of the guardian and only the stable state survives 
crashes. Recall that, for each action, the Argus system keeps track of both modified objects 
and newly created objects in the MOS and does not distinguish between objects accessible 
from the stable variables and objects accessible from the volatile variables. It is the job of _ 
the recovery system, then, to separate the objects in the MOS that are accessible from the 
stable variables from those objects that are inaccessible and to write the accessible objects 
to the log. 

Notice that this concern with accessible objects is really an optimization because we 
could simply write out a// the recoverable objects at a guardian without regard for 
accessibility or inaccessibility; if some inaccessible object were written out to stable storage 
it would not matter since it was unreachable anyway, but it would clutter the log with 


irrelevant information. 


The Problem of Newly Accessible Objects 

Recoverable objects are either previously accessible from the stable variables or. 
newly accessible. | | 

Let us consider previously accessible recoverable objects. if the previously — 
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Figure 3-5: Newly Accessible Objects Example 


x Y 

: 7 01 7 02 
01 zs 02 a in Xe 

int int 03 Ley int 

1. Initial situation . 2. T2 modifies 01 ->03 
aoe x Y 

X Y 01 02 

01 02 
: int 
int __ int 
int = = 03 int 
03 [_e—}—>» int | int 
3. T3 modifies 02 -> 03 | 4. T2 modifies 03 
5. T2 prepares 
6. T3 prepares 

x Y 

O1 02 int x iy 
01 02 
int 
03[_e-}-> int mt 4. Cop int 


7. After T2 aborts  §. After T3 commits 
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treated differently. Since the object is an atomic object that the action has a 
read lock on (and thus there is only a single version), the recovery system 
creates an outcome entry, base_committed, consisting of object uid 0, and the 
copied object version. The recovery system writes the entry to the log, deletes 
object O, from the NAOS, and inserts uid O, into the AS. 


6. The NAOS is empty, so the recovery system is done. It has determined which of 
the objects in the MOS were accessible and has written the corresponding data 
entries to the log. It forces a prepared outcome entry to the log. 


7. The AS now consists of object uids O;; oe O,: 


01 e 03 O1 
int 


int | P 
02 int 


a. T1 gets write lock on O02 
: ~ 
. : int 


b. T1 modifies 02 to point to O3 
Figure 3-6: Newly Accessible Objects 


Notice that there are two phases. First, the recovery system processes every object in 
the MOS (which was one of the two arguments in the call of the prepare operation), copying 
current object versions and writing data entries to the log as it goes along. As these object 
versions are copied, recoverable objects not previously accessible (that is, their uids are not 
already in the AS) may be revealed as newly accessible; these objects are placed in another 
set, the NAOS, consisting of just newly accessible objects. 

Second, when the recovery system has processed the MOS, it then proceeds to 
process the NAOS, if it is not empty. After each object is processed it is deleted from the 
NAOS and added to the AS. Other recoverable objects may become newly accessible and 
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3.4.2 Log Scenarios 


Scenario 1--atomic objects 


Suppose the situation depicted in Figure 3-7 exists at a participant's stable log after a 


crash. 

Sn ee ee Hanne nnnn- $o---t-------- + 
] be | be | 02 |prepared|committed| 01 |prepared| 
| 01 | 02 | at | | at | | 
| Vi | v2 | v2 | | { va | | 
| | | T11 ] Ti | 1 }T2 | Te | 
once tenn tenn t--- He $--------- $----t-------- + 
t t 
log's beginning log's end 


Figure 3-7: Log of atomic objects after a crash 


In this figure (and all figures of this sort) the beginning of the log is on the left and the end of 
the log is on the right; the log grows to the right. The symbols in the log depicted have the 
following meaning. T, and T, are actions. Action T, has committed; action T, has 
prepared. O, and 0, represent unique object identifiers; and V, and V, are the object 
values, that is, the versions of objects. 

Let us develop some notation to make it easier to talk about data entries and outcome 


entries in alog. Let data entries be represented as quadruples: 
<object uid, object type, object version, action identifier> 


so a data entry might look like <O,,atomic,V,,T,>, where O, is the object uid, atomic 
indicates that the object version is atomic, V, is the object version, and T, is the action id. 


Let us represent outcome entries as doubles of 
<outcome, action identifier> 


and so the first two outcome entries would look like <prepared,T ,> and <committed,T,>. The 
only exception is committing, which also includes a list of guardian ids. Furthermore, we 


represent the special outcome entries in the following way: 
<be, object uid, object version> 


where "bo" is short for base _committed: 
<pd, object uid, object version,action id> 


where "pd" is short for prepared _data. 
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« At algorithm's end, the PT and OT contain the following information. - 


PT OT 
T1 committed O1 restored vm address 
T2 aborted 02 restored vm address 
T3 committed 03. restored vm address 


Notice that the stable state of the guardian in volatile memory following recovery will 
look exactly like the situation that existed before the crash in Step 8 of Figure 3-5, which is 


what we wanted. 


Scenario 4 


Suppose the situation depicted in Figure 3-10 exists at a guardian’s log, after a crash. 


ta nt te tern nn occ wee en- teoterecnr en perce ee nnn toocecen-- te-oot 
[bc{O1|bc{prepared|committed|02|prepared| committing] committed|done| 
JO1jat{o2| Jat; ~ | | 
|Vi[V1jVv2] | |V2| | P1,P2,P3 | 

1 [T1t [— Tia | T1 |T2]| 2 |. Te | T2 | T2 | 
tanto te npe nn n-ne - Hoon ----- $--$-------- $eccnne---- toonccen-- to---+ 
+ + 
log's beginning ar log's end 


Figure 3-10: Coordinator's log following a crash 


In this scenario we show the entries that are written to the log for the coordinator of an 
action during two-phase commit. 

To recover the objects from the guardian’s log in Figure 3-10, we need to extend the 
algorithm to include coordinators. Let us add a third table, which stores information about 


coordinator states. Thus, 
CT: action id — coordinator action state 


where coordinator action state = {committing, done}. committing contains a list of the 
guardian identifiers that were involved in the action. 

Notice that in the guardian’s log a particular ordering of outcome entries holds true if 
the top-level action committed successfully: prepared, committing, committed, and done. 
Why? When each participant has prepared, it forces the prepared outcome entry to its log. 
The coordinator, upon hearing that everyone has prepared, forces the committing entry to 


of the participants and coordinators. 


Data entry 


object value 


Outcome entries for participants 


prepared committed 


<uid,log address> 


Figure 4-1: New format of log entries 


the participant’s log for some action and internally keeps track of the object uids and the log 
addresses of the data entries. When it is finished, it creates a prepared outcome entry 
consisting of the list of <uid, log address> pairs and the log address of the previous outcome 
entry and forces the entry to the log. Notice that the recovery system must keep track of this 
information for every preparing action. The only other difference is that each of the other 
outcome entries is linked via the log pointer field to the previous outcome entry before it is 
forced to the log. | 
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commited 
T1 


Figure 4-2: Log after the prepare phase 


4.3 Recovering objects from the log 


o. 


In this section we present a sketch of the general recovery algorithm. One log 
scenario demonstrates the manner in which the new recovery algorithm recovers objects 
from the log. We then give a detailed explanation of the differences between this recovery 


algorithm and the simple log’s recovery algorithm. 
4.3.1 Sketch of the General Algorithm 


1. Create three tables: (1) an object table (OT) that maps object uids to both object 
states (prepared or restored) and object locations in volatile memory, (2) a 
coordinator action table (CT) that maps action ids to coordinator action states 
(committing and done, where committing also has a list of guardian ids of 
guardians involved in the action), and (3) a participant action table (PT) that 
maps action ids to participant action states (prepared, committed, and aborted). 


2. Read the log backwards, starting with the last outcome entry in the log. For 
every outcome entry on the backward chain of outcome entries, process it in the 
following way: 


a. If the outcome entry is aborted, committed, committing, or done then fill 
the three tables with appropriate information (action ids and action states 
like prepared). 


b. If the outcome entry is a prepared entry, then for each <uid, log address> 
pair in the entry check the OT and determine whether or not to copy the 
object version into volatile memory; if it needs to copy the object version it 
follows the log address pointer to the data entry itself. 


v1 v1 v2 Fel 
mut | mut mut 
e 


Figure 4-3: Hybrid log after T1 prepares and T2 commits 


information and forced it to the log for action T;. 


7. The participant received a commit message for T, from its coordinator. The 
recovery system created the committed outcome entry with the proper 
information and forced it to the log. 


8. The Argus system crashed. 

On recovery we see that the earlier version, rather than the latest version, of O, gets 
copied to volatile memory, which is wrong. To solve this problem, we need to keep some 
extra information in the OT for mutex objects, namely, the log address of the "latest" data 
entry for that object that had been copied from the log. When we encounter another data 
entry for that object, we compare its log address with the one stored in the OT. If the new 
address is less than the old one, then the recovery system ignores the entry. If the new 
address is greater, then the recovery system copies the object version in the data entry to 
volatile memory and updates the OT with this data entry's log address. Also, the vm address 
field is updated with the new address of the object version. 
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accessible objects. The disadvantage of the snapshot is the space required for the MT and 
the time used in keeping the MT up to date in volatile memory. The time required to update 
the MT should be insignificant since the MT can be organized as a hash table; therefore, 
only the space consumed by the MT is significant. We expect that it will be worthwhile to 


trade this space for the time saved. 
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behave as they should. More work remains to be done, however. At one:extreme is the 
verification of the algorithms. We need to state precisely what the correctness properties 
are for the algorithms and then verify that the algorithms preserve those properties. For 
atomic objects the property is that the state of each object after a crash is exactly what is. 
obtained from running all actions that committed at a guardian in their serial order. For 
mutex objects, however, the property is not so easy to state because of the semantics of 
Argus that requires recovery of all mutex versions written for a prepared action. 

At the other extreme is a real implementation of the recovery system and its 
algorithms. The system must then be run in support of "realistic" applications and its 
performance measured. In this way we will be able to evaluate the efficiency of the 
algorithms, and we will be able to validate or disprove the assumptions on which the 
recovery system is based. | | 

Finally, the recovery system is based on an abstraction of stable storage, the stable _ 
log. This abstraction must be implemented using real storage devices in a way that provides 
the needed reliability. | 
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